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Abstract 

Quantum classification is defined as the task of predicting the associated class of an unknown quantum 
state drawn from an ensemble of pure states given a finite number of copies of this state. By recasting 
the state discrimination problem within the framework of Machine Learning (ML) , we can use the notion 
of learning reduction coming from classical ML to solve different variants of the classification task, such 
as the weighted binary and the multiclass versions. 

1 Introduction 

Suppose that you are given an unknown quantum state drawn from an ensemble of possible pure states 
where each state is labeled after the class from which it originated. How well can you predict the class 
of this unknown state? This general question is often referred to in the literature as (quantum) state 
discrimination^ [6] and has been studied at least as far back as the seminal work of Helstrom in the seventies 
in the field of quantum detection and estimation theory [17]. Of course, the answer will depend on parameters 
such as the structure and your knowledge of the ensemble of pure states, the dimension of the Hilbert space 
in which the quantum states live and the number of copies of the unknown state you received. 

In this paper, we take a Machine Learning (ML) view of the problem by recasting it as a learning task 
called quantum classification. Our main goal by doing so is to bring new ideas and insights from ML to help 
solve this task and some of its variants. Other motivations include the characterization of these learning 
tasks in terms of the amount of information needed to complete them (measured for instance by the number 
of copies of the quantum states) and the development of a framework that can be used to relate and compare 
these tasks. 

This approach of performing learning on quantum states was originally taken and defined in [2] , where it 
was illustrated by giving an explicit algorithm for the task of quantum clustering, where the goal is to group 
in clusters quantum states that are similar (using the fidelity as a similarity measure) while putting states 
that are dissimilar in different clusters. The model of learning on quantum states put forward in this paper 
is complementary to a model proposed by Aaronson [1], where the training dataset is composed of POVM's 
(Positive- Operator Valued Measurement) , and not quantum states. In Aaronson's model, we receive a finite 
number of copies of an unknown quantum state and the goal is, by "training" this state on a few POVM's, 
to produce with high probability a hypothesis that can generalize with a reasonable accuracy on unobserved 
POVM's belonging to this training dataset. 

The outline of this paper is as follows. First, the model of performing learning in a quantum world is 
introduced in Section 2 along with the notion of learning reduction which allows us to relate together different 
learning tasks. Afterwards, in Section 3, the task of binary classification is described, and the weighted and 



^ Other common names include state distinguishability and state identification. 
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multiclass versions of this task are defined respectively in Sections 4 and 5. Finally, Section 6 concludes with 
a discussion. 

2 Learning in a quantum world 

Machine Learning (ML) [15, 22, 29] is the field that studies techniques to give to machines the ability to 
learn from past experience. Typical tasks in supervised learning include the ability to predict the class 
{classification) or some unobserved characteristic {regression) of an object based on some observations. In 
unsupervised learning, the goal is to find some structure hidden within the data such as discovering "nat- 
ural" clusters {clustering), finding a meaningful low- dimensional representation of the data {dimensionality 
reduction) or learning explicitly a probability function (also called density function) that represents the true 
distribution of the data {density estimation). ML algorithms learn from a training dataset which contains 
observations about objects, which are either obtained empirically or acquired from experts. 

2.1 Learning with a classical dataset 

In classical ML, the observations and the objects arc implicitly considered to be classical and the machine 
which performs the learning is assumed to be a classical computer (such as a classical Turing machine 
or a classical logical circuit). For instance, in supervised learning, a training dataset containing n data 
points can be described as £)„ = {(.xi, yi), . . . , (.t„, yn)}, where Xi would be a vector of observations on the 
characteristics of the i^^ object (or data point) and is the corresponding class of that object. As a typical 
example, each object can be described using d real-valued attributes (i.e. Xi G M"^) and if we are dealing 
with binary classification (i.e. yi G {— 1,-|-1}). 

Example 1 (Classical classification tasks). Recognition of the digital fingerprints or the face of a person 
(in this case each class corresponds to a person), automatically classify a news article as belonging to the 
"culture" or "sports" section, detection of frauds, music genre classification, etc... 

The main difference between supervised and unsupervised learning is that in the latter case, the y^ values 
are unknown. This could mean that we know the possible labels in general but not the specific label of each 
data point, or that even the number of classes and their labels are unknown to us. 

2.2 Learning with a quantum dataset 

In a quantum world, an ML algorithm still needs a training dataset from which to perform learning, but this 
dataset now contains quantum objects instead of classical observations on classical objects (the machine is 

also a quantum computer). 

Definition 1 (Quantum training dataset). A quantum training dataset containing n pure quantum, states 
can be described as D„ — {{\-4'i),yi), . . . ,{\ipn),yn)}, where \tpi) is the i*^ quantum state of the training 
dataset and yi is the class associated with this state. 

Example 2 (Quantum training dataset composed of pure states defined on d qubits). In the context where 
all the pure states in the training dataset live in a Hilbert space formed by d qubits and we are interested in 
the task of binary classification ; \ipi) G and yt G { — 

In this work, we will restrict ourselves to the case where the states are quantum but the classes remain 
classical. Further generalization is to consider the situation in which objects can be in a quantum superpo- 
sition of classes^. Another extension of the model is to allow the quantum states to be mixed, and not only 
pure. 

^Note that being in a quantum superposition of classes is not equivalent to the classical notion of data point belonging to 
several classes in a fuzzy or probabilistic manner. 
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2.3 Learning classes 



One of the intrinsic difficnlty of defining learning in the quantum world comes from the many ways in which 
quantum states can be specified in the training dataset. For instance, the training dataset could contain a 
finite number of copies of each quantum state or consist of a classical description of these (like an explicit 
description of their density matrices). This latter case is the most "powerful" in the sense of information 
theory because from a classical description of a state, it is always possible in principle to produce as many 
quantum copies as desired. 

To formalize this notion, the concept of learning classes that differ in the form of the training dataset, 
the learner's technological sophistication and his learning goal was introduced in [2]. 

Definition 2 (Learning class). For learning class Lgoar^^*' subscript "goal" refer to the learning goal 
and superscript "context" to the form of the training dataset and/or the technology to which the learner has 
access. 

Possible values for goal are cl, which stands for doing ML with a classical purpose, in mind, and qu for 

ML with a quantum motivation. Similarly, the superscript context can be cl for "classical" if everything is 
classical (with a possible exception for the goal) or qu if something "quantum" is going on in the learning 
process. Other values for context can be used when we need to be more specific. For example, L^' corresponds 
to ML in the usual sense, in which we want to use classical means to learn from classical observations about 
classical objects. Another example is L^", in which we have access to a quantum computer to facilitate the 
learning but the goal remains to perform a classical task: the quantum computer could serve to speed up 
the learning process. 

In this paper, we are only concerned with the specific case where "goaZ = gu". 

Definition 3 (Quantum learning from the classical description of the quantum states). L^^ is defined as the 
learning class in which we receive the classical descriptions of the quantum states from the training dataset 
(i.e. Dn = {("01)2/1)) • • • ) {'4'n,yn)}, whcrc tpi is the classical description of quantum state \tl>i)). 

Learning becomes more challenging^ when the dataset is available only in its quantum form, in which 
case more copies make life easier as we can potentially extract more information on the state. For instance, 
a corollary of the Holevo bound [19, 13] states that it is impossible to extract more than d classical bits 
of information from a quantum state living in a Hilbert space formed by d qubits. Moreover, the no- 
cloning theorem [30] forbids us to produce two identical copies of an unknown quantum state. Finally, some 
tradeoffs exist between the amount of information that we can learn on a quantum state and the corresponding 
perturbation than this process will generate (see [21] for instance). 

Definition 4 (Quantum learning from a finite number of copies of the quantum states). L®f is defined 
as the learning class in which we are given at least s copies of each quantum state of the training dataset 
(i.e. Dn {(|?Ai)®*,yi), • • • , (iV'n)®",?;™)}; where IV'i)®*' symbolizes s copies of state 

Contrast these classes with ML in a classical world (such as L^J), in which additional copies of a particular 
object are obviously useless as they do not carry new information. The main purpose of defining quantum 
learning classes is to be able to put some quantum training datasets and some learning tasks within them. 

The quantum learning classes form a hierarchy in an information-theoretic sense, where the higher a 
class is located inside the hierarchy, the more information it contains in order to realize tasks linked to the 
datasets belonging to this class. The class L^^ is at the top of the hierarchy since it corresponds to having a 
complete knowledge about the quantum states forming the training set. Let =i, <(, and <i be the operators 
which denote respectively the equivalence, the weaker or equal and the strictly weaker relationships within 

^Remark however that the classical description of a state is generally exponentially longer to write if it is represented 
classically as a string of bits compare to the corresponding quantum state in the form of qubits. Therefore, we can imagine 
a paradoxical situation where to describe classically the 2^""" amplitudes of quantum state defined on 1000 qubits, we would 
need more memory than there are atoms in the universe, and this even if each atom could be used individually as a classical 
unit of memory (i.e. a bit). By contrast, if we can coherently manipulate the atoms and maintain them in superposition, 1000 
atoms would suffice to store the same state. 
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the hierarchy. The following propositions (first stated in [2]) describe some relations between the learning 

classes forming the hierarchy. 

Proposition 1. L®f =^ L^'„ as s —> oo. 

Proof. When the number of copies tends to infinity, it is always possible to estimate using quantum 
tom,ography and reconstruct the classical description with arbitrary precision. □ 

Proposition 2. LfJ <,...<, L®f <, L®f+i <,...<, L^^. 

Proof. Each new copy of a state gives potentially more information on that state. Therefore for any positive 
integer s, we have <^ V.®^'^^, which implies that if a learning task A e L®f, it also belongs to L®f+^. 
Furthermore due to Proposition 1, a classical description of a state is as good as any number of copies. □ 

Proposition 3. L®f + L®^^ <^ L®f+^, where "+" denotes a restriction that the first s copies must be measured 
separately from the the last. 

Proof. Performing a joint measiirement by allowing ,s + 1 copies to interact together can potentially give 
more information than performing a joint measurement on s copies plus a separated measurement on another 
copy. (See [26] for a specific instance where s = 1 and [12] for results about arbitrary s.) □ 

An interesting open question is whether or not this hierarchy is strict. 

Open question 1 (Strict hierarchy of learning classes). In the expression Lf^ <e ■ ■ ■ <e L®f <^ <^ 
■ ■ ■ <e Lq^, can some of these <e be replaced by <e ? 

There are good reasons to believe that the answer is positive since it is usually the case that more infor- 
mation can be obtained about a quantum state when more copies are available. Moreover, it has been proven 
that in some situations that joint measurements are more informative than individual measurements [26, 12]. 
However, it does not necessarily follow that this additional information can be used in a constructive manner 
to solve some learning tasks. 

2.4 Learning reduction 

The notion of reduction between learning tasks [8] was developed and formalized during these last years in 
the context of classical ML by Langford and co-authors^. 

Definition 5 (Learning reduction [8]). A learning task A reduces to some other learning task B if by having 
access to a black-box (an oracle) that solves B, it is also possible to solve A. 

A learning reduction can be seen as an information-theoretic statement about how well it is possible to 
solve a particular learning task given an algorithm (modelled abstractly by an oracle) that can solve another 
task. Although in general it is desirable for this transformation to be efficient, learning reductions differ 
from the "traditional" reductions used in complexity theory (such as Turing or Karp reductions) in the sense 
they do not try to characterize the computational time needed to solve a particular task. Rather, learning 
reductions offer a way to compare and relate two different learning tasks in the sense of information-theory. 
If A reduces to B, it means that any progress made on how to solve B can be transferred directly to A by 
using the reduction. Moreover, if different tasks all reduce to a single learning primitive, it means that an 
improvement on this primitive has a direct impact on all the other tasks. For instance, in sections 4 and 5, 
we will sec how to solve the weighted binary and the multiclass classification tasks given an oracle for solving 
the standard binary classification (section 3). 

A good reduction often offers some guarantee on how well the performance of the black-box in solving 
problem B also implies a good performance regarding problem A. For instance in classification, this guarantee 

*Scc for instance the webpage of Langford's project on learning reductions http://hunch.net/~jl/projects/reductions/ 
reductions .html. 
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could take the form of upper bounds on the error achieved by the final classifier. The upper bounds generally 
relate the average error of the classifiers generated by the oracle on subproblems B to the global error that 
the final combined classifier will make on the general problem A. 

Definition 6 ((Training) error). The (training) error e (or error ratej of a classifier f is defined as the 
probability that this classifier predicts the wrong class yi on a quantum state \ipi) drawn randomly from the 
states of the quantum training dataset Dn- Formally: 



This definition characterizes precisely the training error of the classifier but not its generalization error, 
which corresponds to how well the classifier predicts on states that it has not observed exactly beforehand 
(i.e. that are not part of the training dataset). For now, we will focus only on the minimization of this 
training error but we will come back to the generalization error (which is really the essence of ML) in the 
discussion (Section 6). 

In the context of quantum classification, the notion of regret also takes a particular importance. 

Definition 7 (Regret). The regret r of a classifier f is defined as the difference between its error rate e/ 
and the smallest achievable error Cgpt that can be achieved on the same problem. Formally: 



The regret of a classifier, as well as its error, can potentially take any value in the range between zero 
and one. The concept of regret is particularly meaningful in the context of hard learning problems, where 
the raw error rate alone is not an appropriate measure to characterize the inherent difficulty of the learning. 
Indeed, in some learning situations, it is possible to observe a high error rate but a low (or even null) regret. 
In the classical setting, a high error rate but a low regret is an indication of a high level of noise. The 
situation is different in the quantum world where a high error rate might be due to the intrinsic physical 
difficulty of distinguishing two classes, but does not necessarily imply a high level of noise. Regardless of the 
context, if the regret of a classifier is zero, it essentially means that this classifier is optimal. 

Quantumly, a reduction or a learning task may also have a cost associated with it. Indeed, each call to the 
oracle may require sacrificing some copies of the quantum states due to the measurements performed by the 
oracle during the training. This cost is measured in terms of the number of copies required individually for 
each quantum state of the training dataset. Another way to define this cost would have been to count globally 
the number of copies required relative to the size of the training dataset^. The cost can be differentiated 
between the number of copies needed during the training/learning phase, where we learn/build a POVM / 
that acts as the classifier, and during the classification time (or testing phase) where we use / to classify an 
unknown quantum state IV'?)- 

Definition 8 (Training/learning cost). The training/learning cost of a reduction is equal to the number of 

calls to the oracle made by the reduction, multiplied by the number of copies of each quantum states that are 
used in each call. In the case of a learning task, the cost is directly caracterised by the number of copies of 
each state necessary to perform this task. 

If we have a classical description of the quantum states (i.e. Dn G Lgi), the training/learning does not 
cost anything in terms of information because we already have complete knowledge of the quantum states. 

Definition 9 (Classification cost). The classification cost corresponds to the number of copies of the unknown 

quantum state \ip->) that will be used by the classifier to predict the class y-f of this state. 

In the next three sections, we will define respectively the quantum analogues of three learning tasks: 
binary classification, weighted binary classification and multiclass classification. 

^Which is generally the same as multiplying the individual cost by a factor lineaj: in n, the number of states in the training 
dataset. 




(1) 



rf = ef- Copt 



(2) 
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3 Binary classification 



The task of binary classification consists in predicting the class y? G {—1, +1} of an unknown quantum state 
IV'?), given a single copy of this state^. Formally, this learning task can be defined in the following manner. 

(Quantum) binary classification : 

Input: Dn = {{\i^i),yi), . . . , (IV'n), 2/n)}, a quantum training dataset, where e C^'' and yi e { — 1, +!}• 
Output: A POVM acting as a binary classifier / that can predict the class y? of an unknown quantum state 
IV'?) given a single copy of this state. 

Goal: Construct a binary classifier / that minimizes the training error cf = ^ ^2^=1 Pi'ob(/(|V'i)) Vi)- 

A natural question to ask is what is the best probability of success wc can hope for, or, equivalently, 
the smallest error rate achievable. The easiest situation occurs when we have complete classical knowledge 
of the quantum states which compose the training dataset (D„ e L^^). However, even in this case, it is not 
generally possible to devise a process that always correctly classifies any unknown state from a single copy 
of this state. This remains true even if we know in advance that this state corresponds exactly to one of the 
states in the training set^. From the classical description of the states, it is possible to analytically build the 
optimal POVM that minimizes the training error. Of course, it remains to be seen how such an approach 
would generalize when faced with a state which does not belong to the training set. This fundamental question 
will be briefly discussed in Section 6. 

Let m_ be the number of quantum states in Dn for which yi = —1 (negative class), and its complement 
m+ be the number of states for which = +1 (positive class), such that m_ + m+ = n, the total number 
of data points in Z)„. Moreover, p_ is the a priori probability of observing the negative class and is equal 
to p- = and p+ its complementary probability for the positive class such that p_ +p+ = 1. 

Definition 10 (Statistical mixture of the negative class). The statistical mixture representing the negative 
class P-, is defined as :f^J27=i ^{Vi ~ ~l}|V'j)(V'i|> where /{.} is the indicator function which equals 1 if its 
premise is true and otherwise. 

Definition 11 (Statistical mixture of the positive class). In the same manner, the statistical mixture 
representing the positive class p+ is defined as ^ ^{Vi — +l}|V'i)(V'i|- 

The problem of classifying an unknown state \ip?) drawn from the training set is equivalent to distinguish 
between the mixed states p_ and p+. Consider for instance the following scenario which illustrates this idea. 

Scenario 1 (Preparation of the state of a class by a demon^). Imagine a demon that sits inside a black-box 
with a single button. Each time the button is pressed, the demon chooses at random between the negative and 
positive class according to their a priori probabilities p- and p+ . Once the class is determined, the demon 
chooses uniformly at random one of the states belonging to this class and prepares the corresponding state 
( we suppose that the demon in its infinite power knows the classical description of the states and can prepare 
perfectly any one of them). This state is returned as output by the black-box. Therefore, finding the class 
of this state is essentially the same as guessing which class the demorfi has chosen during the first step, but 
not necessarily identifying the exact state. 

The minimal error rate of this classification process is linked to the statistical overlap of the mixtures p- 
and pj^. This kind of problem has already been studied in quantum detection and estimation theory [17], a 
field that predates quantum information processing. Some results from this field can be used to give bounds 

on the best training error that quantum learning algorithms might reach. 

''Sec however the work of Sasaki and Carlini [28] for the case of more than one copy of the unknown state \ip'!) are available. 

^Unless wc arc in the trivial situation where all the states arc iimtually orthogonal. In this case, a non-destructive measure 
in a basis formed by these states will reveal the state without perturbing it. 

*This scenario could be reformulate by replacing the demon by a probabilistic algorithm. This raises the question of how 
much classical memory will the algorithm need to remember the description of the states. 

^Here the role of the demon is simply to prepare the state, and not to act as an adversary which tries to fool the learner 
who is outside the box. 
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Theorem 1 (Helstrom measurement [17]). The error rate of distinguishing between the two classes p- and 
p+ is bounded fronn, below by Chei = 5 — £iE^£±l^ where D{p-,p+) = Tr|p_p_ — is a distance measure 
between p- and p+ called the trace distance (here, p_ and p+ represent the a priori probabilities of classes 
p_ and p+, respectively). Moreover, this bound can be achieved exactly by the optimal POVM called the 
Helstrom measurement. 

Corollary 1 (Regret of Helstrom measurement). The Helstrom's measurement is a binary classifier that 
has a null regret, which means r^ei = 0. 

Proof. The null regret of the Helstrom measurement follows directly from the optimality of this POVM to 
distinguish between the two classes. □ 

Remeirk 1 (Error rate of the Helstrom measurement for equiprobable classes). Consider the case where 
both the negative class and the positive class are equiprobable. If p- and p+ are two density matrices which 
correspond to the same state, their trace distance D{p-,p+) is equal to zero, which means that the error 
ihei of the Helstrom measurement is 5 . On the other hand, if p_ and p+ are orthogonal, this means that 
D{p-,p+) = 1 and that the Helstrom measurement has an error Chei = 0. 

The purpose of a learning algorithm in the quantum setting is to give a constructive way to come close to 

(or to achieve) the Helstrom boimd. If we know the classical description of the quantum states, it corresponds 
to finding an efBcient implementation of the Helstrom measurement. If Dn G L^*, the learning becomes 
more challenging and it is difficult to characterize the exact relationship between the number s of copies of 
each training state that are available, the dimension d of the Hilbert space in which the quantum states lives 
and the minimal error rate e we can hope to reach. Contrary to classical ML, where it is always possible (but 
not recommended in terms of generalization) to bring the training error down to zero (for instance using a 
memory-based classifier such as 1-nearest neighbour), the situation is different in the quantum context as 
expressed by the following lemma. 

Lemma 1. It is impossible to reach a training error of zero in the quantum case from a single copy of an 
unknown quantum state unless of the states of the training dataset are mutually orthogonal. 

Proof. Prom Theorem 1 and Remark 1, it is easy to see that it is impossible to construct a POVM that 

perfectly classifies a quantum state drawn from the training set D„, unless all the states of the ensemble 
are mutually orthogonal, or equivalently that the distance between the two density matrices of the classes is 
D{p_,p+) = 1. □ 

Given a finite number of copies of each state of the training set, the possible learning strategies include: 

(1) the estimation of the training set by making measurements (joint or not) on some of the copies to 
construct a POVM that will differentiate between the two classes, 

(2) the design of a classification mechanism that uses the copies only when the time of classifying an 
unknown quantum state comes or 

(3) any hybrid strategy between (1) and (2). 

For the classification, several measurement strategies exist in the quantum context such as: 

(a) maximizing the probability of predicting the class of an unknown quantum state (which corresponds to 
the Helstrom measurement [17]), 

(b) minimizing the probability of making a wrong guess. This strategy is called unambiguous discrimina- 
tion [18] and is possible only when the states of £>„ are linearly independent. In this specific case, it 
is possible to design a measurement that is allowed to sometimes answer "I don't know" , but when it 
makes a prediction regarding one the classes we can be 100% confident than its prediction is correct. 
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(c) any strategy between these two extremes (a) and (b). A confidence-hased measurem.ent}'^ [14] is a 
measure that can identify the class of a state with some confidence (that is known), or answer "I don't 
know" the rest of the time. For a fixed chosen confidence, the main objective when we build such 
a measure, is to minimize the probability that it outputs "I don't know" . When the confidence is 
fixed at 100% this directly corresponds to the unambiguous discrimination, whereas if an inconclusive 
answer is not allowed it corresponds to the Helstrom measurement. It is sometimes possible to design 
a confidence-based measurement (with a confidence greater than the Helstrom measurement) even 
when perfect unambiguous discrimination is impossible (for instance if the states of Dn are linearly 
dependent). 

In this paper, we will focus only (exception made of section 5.1) on the measurement strategy of max- 
imizing the probability of identifying correctly the class of a state (measurement strategy (a)) by learning 
from the training dataset a POVM that can act as a classifier (learning strategy (1)). We will make the 
assumption that we have access to an oracle, called the Helstrom oracle, than can eflaciently solve the task 
of binary classification. 

Definition 12 (Helstrom oracle). The Helstrom oracle is an abstract construction that takes as input: 

Version 1: a classical description of the density matrices p_ and p+ and their a priori probabilities p- 
and p+ (learning class L^^) or 

Version 2: a finite number of copies of each state of the quantum training dataset Dn (learning class 
I «ie(t6i„) , 

i-qu J- 

From this input, the oracle can be "trained" to produce an efficient implementation (exact or approximative) 
of the POVM of the Helstrom measurement fhei, in the form of a quantum circuit that can distinguish 
between p_ and p+ . In the second version of the oracle, its training cost tun corresponds to the minimum 
amount of copies of each state of the training dataset that the oracle has to sacrifice in order to construct 

fhel • 

One fundamental question deals with the (non-)existence of an efficient implementation for the Helstrom 
measurement. 

Open question 2 (Efficient implementation of the Helstrom measurement). What are the learning situations 
(i.e. the ensembles of quantum states) for which it is possible to implement efficiently (for instance with a 
polynomial-size circuit) an approximate version of the Helstrom measurement? 

There is no a priori guarantee that the description of the POVM which corresponds to the Helstrom 
measurement can be physically realized by a quantum circuit whose size is polynomial in the number of 
input qubits. Indeed in the worst case, it could happen that this circuit requires a number of gates that is 
exponential in its input size, and this even for its approximate version. 

By assuming the existence of the Helstrom oracle, we deliberately avoid the burden of describing explicitly 
how the learning algorithm, which acts as the oracle in practice, works (and how many quantum states it 
requires for the learning process). Designing a learning algorithm that can solve the binary classification 
task in practice is a fundamental open question. 

Open question 3 (Construction of a learning algorithm implementing the Helstrom oracle). Is it possible to 
design a learning algorithm that implements explicitly the Helstrom oracle? If so, what would be the value of 
thin, the minimum number of copies of each training state, that this algorithm requires during the learning? 

This is a fundamental question on its own but instead we focus on what tasks could be solved if we have 
access to such an oracle. If we know a learning algorithm which has a low - albeit not optimal - error rate, 
it is possible to use it instead of the Helstrom oracle in almost all the reductions described in this paper. 

^"The original term is maximum- confidence measurement. 
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Suppose that wo have a binary classifier / that can predict the class of an unknown quantum state \ib-?) 
with an error e, for e < 5. If we have access to a constant number of copies of IV'?); we can simply repeat 
the application of this classifier and output the majority of its predictions. By standard ChernofiF argument, 
this will diminishes the probability of making an error exponentially fast with the number of copies spent. 
This is true in the quantum world due to the inherent probabifistic nature of the measurement process. In 
classical ML, the situation is different as generally classifiers behave in a deterministic manner, meaning that 
they will always predict the same outcome when we present them with the same data point. 

4 Weighted Binary Classification 

The weighted binary classification task is similar to the standard binary case, except that now each data 
point has a weight w associated to it that indicates the importance of correctly classifying this state. This 
weight can represent for instance a penalty that we have to pay if we predict the wrong class for this object. 
If w = ^ for each state, then this corresponds to the standard binary classification. 

(Qucintum) weighted binary classification : 

Input: Dn = {{\tjji),yi,wi), . . . ,{\ilJn),yn,Wn)}, a quantum training dataset, where \ipi) € , yi e 
{-1, +1} and Wi e [0, +00). 

Output: A POVM acting as a binary classifier / that can predict the class y? of an unknown quantum state 

m. 

Goal: Construct a binary classifier / that minimizes the weighted training error rate e/ = WiPmh^f ^ 

Vi)- 

Once again, if we are in the idealized situation where we know the classical descriptions of the states 
(learning class L^^), their weights can be directly incorporated in the description of the density matrices of 
their classes. In this scenario, the following reduction formalizes how to solve the weighted binary classifica- 
tion task given the access to an Helstrom oracle (version (1)). 

Reduction 1 (Reduction from weighted binary classification to standard binary classification (via Helstrom 
oracle)). Given the access to an Helstrom oracle that takes as inputs the description of the density matrices 
P- and p+ ( and their a priori probabilities p- and p+ J, it is possible to reduce the task of weighted binary 
classification to the task of standard binary classification. 
Training cost: null. 
Classification cost: 9(1). 

Proof. The weight Wi of a particular state can be converted to a probability Pi reflecting its importance by 

setting 

Let be the new a priori probability of the negative class, which is equal to 

n 

p. = j^p^^iy^ = -1} (4) 

i=l 

and p+, its complementary probability such that p_ -t-p+ = 1. Theorem 2 demonstrates that the Helstrom 
measurement which discriminates between the density matrices in which the weights are incorporated is 
precisely the POVM which minimizes the weighted error. Therefore, it suffices to call the Helstrom oracle 
with inputs 

n 

p- = j2pii{yi = -nm{H (5) 
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and 

n 

p+ = Y,Pii{yi = +'^}\^i){^i\ (6) 

i=l 

(with a priori probabilities p_ et This reduction makes only one call to the Helstrom oracle and requires 
only one copy of the unknown quantum state at classification. □ 

Theorem 2 (Helstrom measurement minimizing the weighted error). The Helstrom measurement which 

minimizes the training error between /5_ = Y^i^iPiI{yi = ^^}\ip'i}{ipi\ eiud /3-|_ = X]i=iP«-^{?A = 

(with a priori probabilities p- and p+) is also the POVM which minimizes the weighted classification error 

on the quantum training dataset Dn = {{\tpi),yi,wi), . . . , {\tpn),yn,Wn)}- 

Proof. The Helstrom measurement is the POVM / that minimizes the discrimination error between /3_ 

and /5_|-. This POVM can be decomposed into two elements n_ ct n_|- which both correspond to positive 
semi-definite matrices such that n_ + n_)_ = I, where I is the identity matrix. Therefore, we have: 

€Hel = mm(Tr(n_p+) + Tr(n+p_)) (7) 

that can also be express as 

n 

CHel = min^pi/{t/i = +l}Tr(n_|Vi)(V'i|) + (8) 

i=l 

n 

^ft/{yi = -l}Tr(n+|V'i)(V'i|) (9) 

i=l 

and that simplifies to 

eHel = mm |^f^KProb(/(|Vi)) ^ Vi)^ (10) 



which is the same as minimizing the weighted training error: 



^opt 




(11) 



Copt = min ^PiProb(/(|V'i)) ^ Vi) (12) 

\j=i i=i ' 

eopt = mm (j^ «^iProb(/(|^i)) ^ (13) 

As this POVM is optimal, it automatically implies that its regret is zero. □ 

In the case where only a finite number of copies of each quantum state is accessible, but we know a way of 
producing an efficient binary classifier (such as the Helstrom oracle, version 2), then the costing reduction [31] 
enables to reduce weighted binary classification to standard binary classification. This reduction proceeds 
via a rejection sam,pling mechanism (Algorithm 1) and the aggregation of several classifiers (Algorithm 2), 
and generates an ensemble of T binary classifiers, where T is a small constant chosen independently from 

The output of the final classifier is simply a majority vote on the outputs of the individual classifiers. 
The number of copies of the unknown state I"!/)?) used by the final classifier is a constant Q{T), corresponding 
to the number of binary classifiers forming the aggregated classifier (Algorithm 3). It is clear that the more 
evaluations are done, the more accurate the classification will be, but more copies of will be needed. 
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Algorithm 1 rejection_sampling(D,, t L,,u, '" ) 
Choose a constant c greater than any weight w 
for each state do 

FHp a coin which has a bias of ^ 
if the result is "tails" then 

Keep the copies of the state 
else 

Put them aside 
end if 

end for _ 
Return the new generated distribution D 



Algorithm 2 costing_training(D„ e \-qu ) 
for j = 1 to T do 

Call rejection_sampling(£'„ e L^^*""*"^) to obtain Dj 
Call the Helstrom oracle on Dj to learn the binary classifier fj 
end for 

Return the final classifier / = majority(/i, . . . , /t) 



Reduction 2 (Reduction from weighted binary classification to standard binary classification (via cost- 
ing [31])). Given the access to an Helstrom oracle (version 2) and a quantum training dataset D,,, G L*'^", it 
is possible to reduce the task of weighted binary classification to the task of standard binary classification. 
Training cost: Q{Ttbin)- 
Classification cost: Q{T). 

Proof. During the training, the algorithm costing_training calls the Helstrom oracle T times, for a constant 
T chosen independently from the training dataset The training cost is therefore Q{Ttbin), which corre- 
sponds to the number of calls to the Helstrom oracle multiplied by ttm the number of copies of each state 
required at c;ach call. As each call to the Helstrom oracle produces a classifier, the classification cost is 
0(T), which requires to use a copy of the unknown state \1jj7) for each generated classifier. The analysis 
of the costing reduction [31] demonstrates that the average of the standard training errors that minimize 
the individual classifiers /i, • • • , /t on the distributions Di, . . . , Dt is the same as indirectly minimizing the 
weighted training error of the global classifier /, which means: 

n 

e/ ~ min^-(i;,Prob(/(|V'*)) ^ Vi) (14) 
1=1 

□ 



Algorithm 3 costing_classification( / = (/i,-- 


•,/t)) 


for J = 1 to T do 




Measure % = ./j( h/'?)) 




end for 




Return j/? = majority(2/i, . . . , yr) 





The quantum version of rejection sampling (Algorithm 1) has the additional benefit of "saving" some 

copies of the quantum states during the generation of the distribution biased according to their weights. 
Indeed, the states having a low weight have a higher probability of not being kept in the new generated 
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distribution. Therefore, these states can be put aside and used later, for instance during another step of 
rejection sampling. 

5 Multiclass classification 

In the multiclass version of classification, each state is labeled after a class chosen among k possible ones, 
for k > 2. The goal is to build a classifier / which, given a finite number of copies of an unknown state 
can predict its class y? with a good accuracy. 

(Quantum) multiclass classification : 

Input: Dn = yi)j • • • I (l'0ri)i JJn)}, a quantum training dataset, where [ipi) G and jji G {1, . . . , k}. 

Output: A POVM acting as a multiclass classifier / that can predict the class y? of an unknown quantum 
state IV'?)- 

Goal: Construct a multiclass classifier / that minimizes the training error rate £/ = ^ SiLi P^ob(/(|V'i)) ^ 
Vi)- 

Moving from the binary to the multiclass case is far from being trivial, and very few things are known for the 
case where the number of classes k > 2. In particular even for three classes, the exact form of the optimal 
POVM that can distinguish between these three classes given a single copy of a state is not known. However, 
we will see in Section 5.4 that if we know the classical description of the states, it is possible to design a 
measure (called the Pretty Good Measurement [16]), whose error is bounded by the square root of the error 
of the optimal POVM. 

The following sections describe different training and classification strategies for the cases where we have 
access to a number of copies of the unknown quantum state |'^?) to classify which is: 

- linear in n, the number of states in £>„ (Section 5.1). 

- linear in k, the number of classes in (Section 5.2). 

- logarithmic in k (Section 5.3). 

- a single copy or possibly a constant number of them (Section 5.4). 
5.1 Classification via state identification 

The most direct way of recognizing the class of a state is to identify exactly this state. Once the state is 
identified, this information allows also to recover directly its class (unless there are two, or more, states 
that are identical but labeled with different classes). If B(n) copies of the unknown quantum state \tp?) are 
available, the Control-Swap test [3, 11] can be used between this state and each of the state of the training 
dataset Dn G \-fu^^- This method does not require any effort during the training, all the work being done 
at classification time (therefore it corresponds to a learning strategy type (2), Section 3). This learning 
strategy can be seen as the quantum analogue of the one-nearest neighbours. Indeed, for each state of the 
training dataset, we search the one which is the closest/ the most similar (in the sense of fidelity) from the 
unknown quantum state. Unless there are two quantum states in D„ that are identical but labeled with two 
different classes, this method is guarantee to have a null classification error (and therefore a null regret). 
The following algorithm formalize this method. 



Algorithm 4 classification_viaJdentification(|^7)®Q("), £>„ e L|f ^^^) 
for i = 1 to n do 

Measure the fidelity between \ip-?) and by using the Control-Swap test which gives an estimate of 

Fid{\^2},m) 

end for 

Return the class yj of the state \tpj) whose fidelity with the unknown quantum state is maximal 
argmaxj- FidQip?), 
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Theorem 3 (Classification via state identification). The algorithm classification_via_identification classifies 
an unknown quantum state with a null classification error given Q{n) copies of this state and 0(1) 
copies of each state of Dn ■ 

Proof. Each Control-Swap test require a constant number of copies and as we estimate the similarity between 
\ip?) and all the n states of D„, the global cost of classification_viaJdentification will be 8(n) copies of the 
unknown state and 0(1) of each state of the training dataset. Moreover, if there are not two states in D„ that 
are identical but labeled with two different classes, the algorithm is guarantee to obtain of null classification 
error (which implies a null regret). □ 

If we want to base the prediction of the class of |V'?) on its k nearest neighbours instead of only its nearest 
neighbour, the Algorithm 4 can be easily adapt to base its prediction on a majority vote of their classes (the 
training and classification cost remain unchanged). An interesting avenue of research is to design a quantum 
equivalent to classical data structures that can be used to facilitate the search for nearest neighbours, such 
as the fcd-trees [5] for instance. Quantumly, the main purpose of such a structure would be to retrieve 
the nearest neighbours of an unknown state by consuming less copies than require with the direct naive 
method (for instance by using a number of copies logarithmic in n and linear in c the number of neighbours 
considered) . If we do not know the classical description of the states, the construction of this data structure 
may have a non-negligible training cost. 

5.2 One-against-all reductions 



Algorithm 5 one_against_alLtraining(£)„ e \-qu ) 
for j = 1 to A; do 

Initialize D^^^ as the empty dataset 
for i = 1 to n do 

Add the example (|Vi)®®(*'''''\ 1 - 21 {yi = j}) to 
end for 

Call the Helstrom oracle on the dataset D^J'> to learn a binary classifier fj that discriminates between 
the class j and the union of all the other classes 
end for 

Return the ensemble of binary classifiers /i , . . . , /j 



The main idea of the one-against-all reduction [27] is to train a binary classifier for each of the k classes. 
Each of this binary classifier discriminates between its own class and the union of all the other classes. This 
reduction can be adapted in a straightforward manner to the quantum context by constructing for each class 
a POVM acting as a binary classifier, which discriminates between the density matrix of this class and the 
statistical mixture composed of the density matrices of the other classes. We will say that a classifier "click" 
if it predicts that the unknown state lip?) belongs to its own class, and that it "does not click" otherwise. 
Given the access to an Helstrom oracle, it is possible to reduce the multiclass classification to the standard 
binary case by using the following training and classification algorithms (Algorithms 5 and 6). 

Reduction 3 (Reduction from multiclass classification to standard binary classification (via one-again- 
st-all)). Given the access to an Helstrom oracle and a quantum training dataset Dn € Lgi*^*''*''\ it is possible 
to reduce the multiclass classification task to the standard binary classification via a one-against-all reduc- 
tion. 

Training cost: Q(ktbin). 
Classification cost: G{k). 

Proof. The algorithm one_against_alLtraining calls the Helstrom oracle a number of times which is linear in 
the number of classes k, and each call consumes a number of copies of each state of Dn in Q{tiyin)- Therefore, 
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Algorithm 6 one_against_alLclassification(|^?)®'^) 
for j = 1 to fc do 

Apply a binary classifier fj on j?/^?) to obtain the prediction whether or not this state belongs to the 
class j 
end for 

if only one classifier "has clicked" then 

Return the class associated with the classifier which has "clicked" 
else 

if several classifiers have "clicked" then 

Return a class chosen at random among all the classifiers which have "clicked" 
else 

Return a class chosen uniformly at random among the k classes 
end if 
end if 



the training cost of this reduction is ld{ktbin)- Regarding the classification, we need to sacrifice a copy of the 
unknown state IV'?) for each of the k binary classifiers generated, which leads to a total cost of Q{k). 

Regarding the analysis of the error of this reduction, let ej be the error of the classifier of class j. The 
worst situation that can happen is that the classifier of the "good class" does not click (which corresponds 
to a false negative). In this situation and if no other classifier has clicked, we choose the class to predict 
uniformly at random, which lead to an error with probability In the case of false positives, where c 

classifiers click when they should not, the error rate will be only because we will choose at random among 
the classifiers which have reacted. As each binary classifier fj leads to an error rate of in the worst case 
with probability pjCj (where pj is the a priori probability of class j) and there are k binary classifiers, the 
global error of the classifier will be upper boimdcd by 5^j=il'j*^j! which simplifies itself to {k — l)f if all 
the classes have the same a priori probability ^ and the same error rate e for all the binary classifiers. (This 
reduction does not seem to offer any guarantee for the regret.) □ 

Remark 2 (Difficulty of intermediary learning situations generated by the reduction). Nothing guarantee 
a priori than the intermediary learning situations generated by the reduction (for instance here the k binary 
classification) are easy to solve. Indeed, even if the access to the Helstroni oracle guarantee than the k binary 
classifiers will be optimal for their respective classification settings, it is possible than the observed average 
error will be important. In the quantum case, it can happen for instance than the trace distance between the 
density matrix of a class and the mixture composed of the union of all the other classes is low ( which implies 
that they are difficult to distinguish). If we have a complete classical knowledge of the quantum states instead 
a simply deriving an upper bound of the error of the global classifier, a finer analysis will reveal the exact 
training error of this classifier. 

A weighted variant of the reduction, called weighted, one- against- all [9], offers a better upper bound in 
terms of error than the basic version. This variant exploits the fact than false negatives (not detecting the true 
class) are more damageable to the error of global classifier than false positives (predicting the wrong class). 
In practice, this means that a datapoint will have a higher weight during the construction of the classifier of 
its class. The algorithm proceeds by reducing the multiclass classification to weighted binary classification 
and then use the costing reduction [31] to reduce the weighted binary classification to the standard binary 
classification. The main advantage of the weighted version of this reduction is that it offers a guarantee 
on the error bound of the global classifier of |e, for e the average error of the binary classifiers generated, 
which is divided by two compared to the basic version. In this case, the training cost of the weighted version 
of the one-against-all reduction is Q{kTti,in) where k is the number of classes, T the constant number of 
classifiers generated by the costing reduction and tbm the number of copies used by each call of the Helstrom 
oracle. The classification cost will be @{kT). Quantumly, if we know the classical description of the states of 
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the training datasot (_D„ G L^'^), wo can replace the costing reduction via the reduction using the Helstrom 
oracle (Reduction 1) which results in a training cost of &{ktbin) and a classification cost of Q{k). 

5.3 Binciry tree reductions 

Another way of solving the multiclass version is to build a binary tree where each node is a binary classifier 
which discriminates bet,ween two subsets of classes and where the leaves are labeled after a specific class. The 
root contains the set of all classes and use a binary classifier to divide this set into two subsets of classes of 
approximatively same size. To classify an unknown state, we start from the root and we go down the tree 
according to the output of the binary classifier observed at each node until we reach a leaf, in which case 
we predict the class associated to this leaf. There are several ways of building the binary tree (for instance 
in a bottom- up or top-down fashion), which might lead to a different global error of the final classifier. The 
Algorithms 7 and 8 detail a possible way of constructing recursively the binary tree from the root to the 
leaves, and then use it for classification. 



Algorithm 7 binary.tree.training(D„ e L®f(t'>- i°g fe)) 
if all the states in Z)„ belongs to the same class then 
Create a leaf labeled according to this class 
Return 
end if 

Choose at random two subsets of classes Ya and Yb among £)„ such that \Ya\ « \Yb\ 
Separate the training dataset into two subsets Da and Db according to the two subsets of classes Ya 
and Yb (let pa be the density matrix representing the subset Da and pb the density matrix representing 
the subset Db) 

Call the Helstrom oracle to learn the binary classifier f(p^,p^) which distinguishes between the two 
density matrices Pa and pb 

Create a node in the binary tree whose test corresponds to the binary classifier f(p^^p^) 
Call binary_tree_training(_Da) 
Call binary _tree_training(£'b) 



Algorithm 8 binary _tree_classification(|?/j7)'^®('°s'^\ a classifier / which is a binary classification tree) 
Start the traversal of the tree at the root 
while a leaf is not reach do 

Use a copy of the state j-f/'v) in the binary classifier corresponding to the current node 
if the classifier predicts the negative class then 

Go down the tree on the left 
else 

Go down the tree on the right 
end if 
end while 

Return the class labeled at this leaf 



Reduction 4 (Reduction of multiclass classification to standard binary classification (via binary tree)). 

Given the access to an Helstrom, oracle and a quantum training dataset Dn G L^^*''" , it is possible to 
reduce the multiclass classification task to the standard binary classification via a binary tree reduction. 
Training cost: Q{tbin^ogk). 
Classification cost: 6(logfc). 

Proof. During the construction of the binary tree, the Helstrom oracle is called a number of times which is 
directly proportional to the number of nodes in the tree. However, each call to the oracle splits the dataset 
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into two subsets (whose sum of sizes is equal to that of the original training dataset) . which implies that at 
each level of the tree the number of copies of each quantum state used by the different calls of the Helstrom 
oracle is Q{tbin)- The global cost of the training is therefore Q{tbiniogk) because the depth of the tree is 
Q(\ogk) (for k the number of classes) as it is built to be balanced. The classification cost is also directly 
proportional to the depth of the tree and is 0(log /c). 

The global error of the final binary tree classifier is the sum of probability for each class of having a error 
in the path going from the root to the leaf of this class which is upper boimded by e log fc, if for simplification 
we suppose that all the classes are equiprobable and that all binary classifiers have the same error e. Indeed 
in this case, an error can occur with probability e at each node traversed which implies that the global error 
maybe e log k in the worst case. □ 

Corollary 2 (State identification). Let D„ be a quantum training dataset composed of n pure states such 
that there are not two identical states in Dn- A POVM exists that can identify the index of an unknown 
quantum state chosen at random among the states of Dn with a non-trivial accuracy given 0(logn) 
copies of this state. 

Proof. The proof is relatively direct, it simply involves setting k = n, which means assigning a different class 
to each of the n points of the quantum dataset Dn, and applying the Reduction 4. □ 

If we are in the situation where we have a complete knowledge of the states of the training set (Z?„ € ^qu), 
it is possible to choose the two subsets of classes such that they maximize the trace distance between the 
two density matrices of these subsets. In this case, it is possible to build the tree from the root to the leaves 
by splitting the dataset into two subsets which maximize the trace distance. Another way of growing the 
tree is by starting from the leaves to the root, where at each level we pair the classes that are the easiest to 
distinguish. In particular, a reduction called "/i/ier tree" [10] exists which reduce the multiclass classification 
to the standard binary classification (via weighted binary classification and the costing reduction [31]). This 
reduction builds a multiclass classifier which has the form of a binary tree by starting from the leaves and 
guarantee that the error of this classifier is upper bounded by elogfc, for k the number of classes and e the 
average error of the binary classifiers generated. The strength of this reduction is that it offers a similar 
guarantee for the regret (which was not the case of the algorithm binary _tree_training presented previously). 
The regret of the multiclass classifier will be at most r log k, for r the average regret of the binary classifiers. 

5.4 Pretty good measurement 

If we know the classical description of the states (D„ G L^'„), a general measurement strategy exists, called 
the "Pretty Good Measurement"^^ [16], which enables us to build a classifier, which given a single copy of 
an unknown state \tp?), can predict the class of this state with an error bounded by the square root of the 
error of the optimal classifier. 

Theorem 4 (Error rate of the Pretty Good Measurement [4]). Given the classical description of k density 
matrices Pi,. . . ,Pk, it is possible to build a POVM, called the Pretty Good Measurement, whose error rate 
spGM to distinguish between these k mixed states, given a single copy p? of one of these states, is in the 
worst case quadratically higher that the error Cgpt that would have the optimal POVM. Formally, this means 
that: 



Corollary 3 (Bound on the regret of the Pretty Good Measurement). The regret of the Pretty Good Mea- 
surement is bounded by: 



This measure is sometimes called "square-root measurement" in the literature due to the explicit form of this POVM. 




(15) 




(16) 
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Montanaro [23] proved that the error of the Pretty Good Measurement is always smaller than that of the 
prediction strategy that does not even measure the state, but rather chooses one of the classes at random 
according to their a priori probabilities. He also derived an upper bound on the error of the Pretty Good 
Measurement which depends on the fidelity between each pair of states forming the training dataset. This 
bound is: 

1 " 1 

Definition 13 (Similarity matrix of a quantum training dataset). A similarity matrix^^ S„ of a quantum 

training dataset containing n states is a matrix of size n by n, where each entry S{i,j) of the matrix (for 
i, j G {1, . . . , n} j contains an estimate of the fidelity between the state \ipi) and the state 

It follows directly from the symmetry property of the fidelity, that the similarity matrix is a symmetric 
matrix. An efficient algorithm exists to compute this matrix, which requires only a number of copies of each 
state that is linear in n, the number of states in the quantum training dataset. The Algorithm 9 formalizes 
how to compute the similarity matrix for a quantum dataset 

Algorithm 9 similarity_matrix_computation(£)„ e L^®^^"^) 
for i = 1 to n do 

S{i,i) = 1 
end for 
for i < J do 

Estimate the fidelity between the two states Itpi) and \tpj) by using the C-Swap test e times 
Set the estimate of FidQipi), jf/'j)) to be equal to 1 — ^^^^^^ (where represents the number of 
times where the result |1) has been observed) 
Update S{i,j) = S{j,i) = Fidil^i), 
end for 

Return 5„ the computed similarity matrix 



Theorem 5 (Computation of the similarity matrix). It is possible to com.pute the similarity matrix of a 
quantum dataset Dn with a precision e, for ^= \, from G(en) copies of each state. 

Proof. For each pair of states (|V'i):IV'i)) of the training dataset D„, the Control-Swap test allows us to 
estimate the fidelity between these states with a precision e, where e = ^ for e the number of copies used 

during the test. As the matrix Sn is symmetric, the number of entries to estimate is 6( "^"~^^ ) = Q{n?). 
Therefore for each state we will need 0(e) copies for each of the n Control-Swap tests where this state 
appears, which makes a global cost of 8 (en) copies per state. □ 

Corollary 4 (Upper bound on the error of the Pretty Good Measurement). Given Q{n) copies of each state 
of a quantum training dataset Dn, it is possible to compute an upper bound on the error of that Pretty Good 
Measurement will make on Dn- 

Proof. The proof is straightforward, we only need to apply the algorithm similarity_nnatrix_confiputatlon and 

evaluate the formula 17 by using the estimate the fidelity between each pair of states from the corresponding 
entries of the similarity matrix. □ 

Montanaro also gave another upper bound on the error of the Pretty Good Measurement which depends 
directly on the eigenvalues of the similarity matrix 5„. Let A,, be the i^^ eigenvalue of the similarity matrix. 

^^The similajrity matrix is often called Gram matrix in the literature, especially in classical ML. 
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The error of the Pretty Good Measurement is bounded from above by: 




2 



epGM < 1 



(18) 



This bound can also be explicitly computed from the similarity matrix Sn by diagonalizing it to extract the 
eigenvalues. 

Regarding a lower bound of the Pretty Good Measurement, a recent bound [24], also due to Montanaro, 
proved that this error is bounded from below by: 



where p, and pj arc the a priori probabilities of the states \ipi) and \ipj). In the situation where all these 
states are equiprobable, we can replace all the probabilities by the ^ in the formula 19. Here also the bound 
can be directly estimated from the similarity matrix 5„. 

Intuitively, this bounds seem to indicate that the fidelity between pair of states is a sufficient measure 
to assess the difficulty of distinguishing the states of the training dataset. This intuition is wrong, indeed 
Jozsa and Schlienz [20] have shown that there exist situations where the fidelity between pair of states in the 
quantum dataset Dn is low (which means it is easy to discriminate one of state from the other) . while at the 
same time it is impossible to distinguish efficiently in a global manner one state from all the other states. 

To summarize, it is possible to bound the error that the Pretty Good Measurement would realize even 
without explicitly constructing it. Indeed, we can bound the error of the Pretty Good Measurement given 
a linear number of copies of each state of the quantum dataset, whereas if we want to build explicitly 
the POVM corresponding to this measure all the techniques currently known seems to require to know a 
classical description of the states (which requires an exponential number of copies in the number of qubits 
if we use the tomography in the case of an unknown state). An important avenue of research is whether or 
not it is possible to design a learning algorithm that "learns" an approximate version of the Pretty Good 
Measurement (in the same sense as the Helstrom oracle) from a finite number of copies of each state of Z)„. 

Conjecture 1 (Amount of information necessary to learn the Pretty Good Measurement). The minimal 
number of copies tpcM of each state of the quantum training dataset Dn necessary to "learn" a circuit that 
could implement a non-trivial approximation of the Pretty Good Measurement is polynomial in n the number 
of quantum states in Dn and k the number of classes. 

6 Discussion and conclusion 

The following table summarizes the training/learning and the classification cost of the different learning tasks 
and reductions that we have seen in this paper. The binary classification is the main learning primitive as 
the weighted and the multiclass classification can be reduced to it via the Helstrom oracle. 

In practice, the Helstrom oracle will be implemented by a learning algorithm, which from a finite number 
of copies of each state from the training dataset, outputs a POVM / which can act as a binary classifier. 
Contrary to the Helstrom oracle, this algorithm does not need to be optimal in terms of classification error 
as long as it offers a non-trivial precision which is better than simply guessing randomly the class of the 
unknown qiiantum state. Even in this case, almost all the reductions presented in this paper will work 
although the global error of the generated classifier will likely be higher due to the non-optimality of the 
constructed POVM. Designing a learning algorithm as the Helstrom oracle will enable us to estimate the 
minimum number of copies tbin of each state of the training dataset that is necessary to perform the binary 
classification. 

The essence of ML is to Iciarn from data coming from past experience with the hope of generalizing on 
new situations in the future. In this paper, we concentrate on the accurate classification of states coming 



n 



n 




(19) 



i=l j=i 



18 



Learning task 


Training cost 


Classification cost 


Binary classification 


©(ifcin) 


6(1) 


Weighted binary classification 
(reduction via Helstrom oracle) 
(costing reduction) 


©(ifern) 


6(1) 
6(T) 


Multiclass classification 

(state identification via Control-SWAP test) 

(one-against-all reduction) 

(binary tree reduction) 

(Pretty Good Measurement) 

(Bound on the error of the PGM) 


9(1) 
0(fc4i„) 

e(4miogfc) 

unknown 

e(n) 


6(n) 
6(fc) 
6(log/c) 
6(1) 

not applicable 



Table 1 : Table summarizing the training and classification costs of the different quantum learning tasks and 
reductions seen in this paper. 

from the training dataset Z)„ but we did not discussed how this approach could generalize on quantum states 
unobserved previously. A natural way of defining that a POVM / acting as a classifier generalize is if this 
POVM can recognize the class of a state that is close to one of the state of the training dataset but without 
being identical. The closeness between two pure states can be defined using the fidelity or other distance 
measures such as the Euclidean distance. 

Definition 14 (Euclidean distance between pure states [7]). The Euclidean distance between two pure states 
W = Eti \^) = Eti AN) is defined as DistL2m, m = vSTk^^- 

Bernstein and Vazirani [7] have proven that if two pure states \ip) and |0) of same dimension arc within e 
Euclidean distance of each other, the same measure performed on the two states generates samples from two 
distributions which have a total variational distance of at most 4e. Therefore, if two states are close in terms 
of their Euclidean distance this give a good indication that a PGVM / acting as a classifier will with high 
probability predicts the same class for these two states. Future work in this model of doing machine learning 
on quantum information include the formalization of the notion of testing and generalization error, as well 
as the study of different models of classical and quantum noise (sec for instance the section 8.3 of [25] for 
different forms of quantum noise) and how they affect the robustness of the quantum learning algorithms. 

ML is a field where it is important to valide experimentally the performance of a learning algorithm 
and to compare it to other existing algorithms. Classically, numerous repositories of datasets are publicly 
available such as the repository of the University of California at Irvine^^ {UCI repository) or the MNIST 
database for the recognition of characters'^. Quantumly, once several learning algorithms have been proposed, 
it is also important to test them experimentally on quantum datasets representing realistic situations that 
experimentalists are likely to encounter in their laboratories. The main idea would not be to create physically 
these datasets but rather to give access to their classical descriptions to the community so that anyone who 
want to use and experiment with them using their favorite classical simulator can do it freely. An example of 
two possible classes could be for instance entangled state versus separable states. Moreover, several situations 
that people encounter in quantum information processing can be recast naturally as a classification problem, 
such as for instance the scenario in quantum cryptography where the eavesdropper try to maximize his 
probability of guessing correctly the class of the state that he has intercepted. 
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